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STATEMENT OF FOCUS 



The Wisconsin Research and Development Center for Cognitive Learning 
focuses on contributing to a better understanding of cognitive learning by 
children and youth and to the improvement of related educational practices. 

The strategy for research and development is comprehensive. It includes 
basic research to generate new knowledge about the conditions and processes 
of learning and about the processes of instruction, and the subsequent develop- 
ment of research-based instructional materials , many of which are designed for 
use by teachers and others for use by students. These materials are tested and 
refined in school settings . Throughout these operations behavioral scientists , 
curriculum experts, academic scholars, and school people interact, insuring 
that the results of Center activities are based soundly on knowledge of subject 
matter and cognitive learning and that they are applied to the improvement of 
educational practice . 

This Technical Report is from the Concepts in Verbal Argument Project in 
Program 2. General objectives of the Program are to establish rationale and 
strategy for developing instructional systems, to identify sequences of con- 
cepts and cognitive skills, to develop assessment procedures for those con- 
cepts and skills, to identify or develop instructional materials associate 1 
with the concepts and cognitive skills , and to generate new knowledge about 
instructional procedures. Contributing to these Program objectives, the staff 
of the project developed a semiprogramed course in verbal argument anc’ 'elated 
tests for use at the high school level. The project staff prepared the nu trials 
on the basis of an outline of concepts and critical skills developed from an 
evaluation of everyday discourse. 
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ABSTRACT 



This paper reports the development of a test battery for measuring student 
mastery of certain verbal skills basic to critical thinking. The battery, col- 
lectively entitled The Wisconsin Tests of Testimony and Reasoning Assessment, 
consists of three tests related to testimony and four tests related to arguments 
developed through reasoning. The basic rationale for the tests and major con- 
siderations in test development are explained. Each test is described and ap- 
propriate reliability estimates and item statistics are presented. The particular 
data presented were gathered by an administration of the fifth edition of the 
test battery to over 3,000 junior/senior high students in four Wisconsin school 
systems . 



I 

INTRODUCTION 



This report presents an overview of research 
related to the development of a test battery for 
measuring student mastery of certain verbal 
skills basic to critical thinking. The report 
provides (1) a discussion of the rationale for 
and purposes of the tests, (2) a discussion of 
the development of the tests, and (3) a discus- 



sion of each test including a description of the 
test, reliability and item data, and a brief dis- 
cussion of that data. The following discussion, 
then, is intended strictly as a report of test 
development; it is not offered as a guide to 
the use or interpretation of the tests. 



II 

RATIONALE AND PURPOSES 



In developing the Wisconsin Tests of Testi - 
mony and Reasoning Assessment (WISTTRA) the 
researchers sought (1) to create a valid and 
reliable testing instrument to be generally 
available for assessing student development 
in the mastery of relevant concepts and skills 
of verbal argument for Grades 10—12, and 
(2) to gather data from administrations of this 
battery useful in the development of related 
instructional materials , The tests have been 
published by and a sample copy is available 
from the Wisconsin Research and Development 
Center for Cognitive Learning (Allen, Feezel, 

& Kauffeld, 1968). For a more complete state- 
ment of the project's rationale the reader should 
co n s u It A Taxonomy of Concepts and Critical 
Abilities Related to the Evaluation of Verbal 
Argument (Allen, Feezel, & Kauffeld, 1967). 

WISTTRA is based on a view of verbal argu- 
ment articulated by the English philosopher 
Stephen Toulmin (1958) and adapted to the field 
of ordinary argument by the investigators . His 
program , an off-shoot of Rylean language phi- 
losophy, is developed on two central points: 

(1) the habits of reasoning utilized in any field 
of inquiry involve rules for evaluating infer- 
ences much richer than the field-invariant 
schemes worked out by formal logicians , and 

(2) an adequate account of such rules can only 
be worked out by attending to the nature of 
particular fields of inquiry. The tests dis- 
cussed here grew out of a definition of rules 
of inference fundamentally important to the 
field of ordinary, i.e. , nontechnical, argument. 

In order to characterize the rules of infer- 
ence appropriate to ordinary arguments the 
researchers first isolated three major require- 
ments imposed by the nature of plain discourse 
on ordinary reasoning: (1) ordinary arguments 

must be able to take the reports of other peo- 
ple (testimony) as an important source of pri- 
mary data, (2) ordinary arguments must be 
able to provide reasons for a wide variety of 



claim types , and (3) ordinary arguments must be 
able to handle inferences which utilize cate- 
gories built on multidimensional, loosely re- 
lated configurations of criteria. 

On this foundation two orders of concepts 
were distinguished: (1) those related to ap- 

praising the testimony of others and (2) those 
related to appraising the strength of reasons 
given for a claim. Concepts used in appraising 
testimony may be grouped into two clusters: 

(a) internal tests — position to observe, ability 
to observe, bias, and qualification for judging 
—and (b) external tests — primary as compared 
with secondary information, recent as compared 
with dated information, and consistent as com- 
pared with inconsistent information. Concepts 
used in assessing the strength of reasons may 
be grouped into two clusters: (a) those related 

to the structure of arguments — data, warrant, 
claim, and reservation — and (b) rules for as- 
sessing arguments developed through reasoning. 
The latter are sensitive to the type of argument 
being assessed and are represented in the test 
battery as they are used to assess sign, class, 
causal, alternative, parallel case, comparative, 
and warrant supportive arguments. Basic skills 
used in assessing arguments developed through 
reasoning include (1) the ability to detect mis- 
sing parts of an argument, (2) the ability to 
discern the relevance of objections, and (3) the 
ability to recognize appropriate conclusions. 

The researchers saw a compelling need for 
a test battery adapted to just this configuration 
of concepts and skills, because measuring in- 
struments developed on the assumption that 
rules of inference are field-invariant do not 
assess the student's mastery of the skills and 
concepts appropriate to ordinary argument. 
Commonly, tests based on field-invariant logics 
simply measure the student's mastery of the 
rules of inference appropriate to some preferred 
field of specialized inquiry. Such tests are 
useful when information about fhe student's 
mastery of the reasoning habits of the preferred 
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field are of interest, but they may give a very 
distorted picture of a student’s ability to handle 
everyday arguments. 

In particular, tests of reasoning based on 
field-invariant logics usually neglect the con- 
cepts and skills related to assessing testimony 
and discerning the’ relevance of an objection. 
Tests based on the highly mechanical proce- 
dures for induction and deduction prescribed 
by type logics are particularly vulnerable to 
this criticism. Few ordinary arguments involve 



questions which can be resolved by direct ob- 
servations of the participants, and still fewer 
involve questions which can be fully analyzed 
against the tidy categories such systems re- 
quire. WISTTRA was developed to assess the 
student's ability to evaluate adequacy of testi- 
mony and to recognize the structure that is 
present in ordinary arguments and raise perti- 
nent objections based on the rules of inference 
appropriate to that structure. 
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III 

DEVELOPMENT OF THE TESTS 



EDITIONS 

The development of WISTTRA constitutes 
one phase of research related to the develop- 
ment of student abilities in the assessment of 
verbal arguments. From the project's incep- 
tion the researchers recognized the need for 
an appropriate testing instrument. Work on 
the tests was begun in February of 1966 and 
continued through April of 1968. During that 
period the instrument went through four experi- 
mental editions in which its focus was narrowed 
from Grades 7-12 to Grades 10-12 and its items 
analyzed and revised for greater precision and 
reliability. During that period portions of tne 
battery were pretested on four occasions and 
a normative study was conducted with a fifth 
edition of the battery (See Table 1 .) 

Development of the tests will be discussed 
in terms of the considerations and criteria which 
the investigators used in decisions related to 
test instructions, test vocabulary, student 
interest, subject matter of items , content 
validity, internal-consistency reliability , and 
item characteristics. 



TEST INSTRUCTIONS 

As is often the case, drawing up instructions 
for the various tests in the battery required 
balancing the need to provide sufficient infor- 
mation to complete the task against the demand 
that test instructions not teach the student 
skills the test seeks to measure. In order to 
minimize the confounding effects of test in- 
structions , two forms of instructions were used 
in Pretest One. The two forms differed only in 
that Form A included an example of the response 
task while Form B did not. The two instructional 
forms did not yield significant differences in 
student responses, but from general indications 
and conversations with students the longer form 
was selected for use in all later test administra- 



tions . Care was exercised that the task ex- 
ample not reveal the nature of the cognitive 
skill which the test is to measure. Comments 
on the clarity and interest of all instructions 
were obtained from a panel of high school stu- 
dents (details on the composition of this panel 
is reported under STUDENT INTEREST) and re- 
visions were made according to this feedback. 



TEST VOCABULARY 

The test battery is not intended as a measure 
of reading skills or of vocabulary development. 
To minimize confounding due to such factors , 
items were screened for words not available 
in an average ninth grader's vocabulary 
(Thorndike-Lorge , 1944). In addition Dale- 
Chall (1948) readability scores were computed 
for selected portions of the battery. Scores 
ranged from 7.5 to 8.2, indicating that test 
items are suitable for the average reading 
ability of Grades 7 and 8. These steps do 
not, of course, eliminate confounding due to 
differences in reading skills, but they should 
tend to minimize these differences inso- 
far as possible. 1 In addition, they indicate 
the battery's appropriateness for the intended 
Grades (10—12). 

STUDENT INTEREST 

Immediately following Pretest One students 
were asked to rate the testimony portions of 
the battery on seven-step interest, readability, 
and difficulty scales . Testimony I was rated 
as quite readable and quite easy, while all 



1 The correlations of tests in this battery with 
various IQ and reading scores are available on 
request. 
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Table 1 



Test Administrations in the Development of WISTTRA 





Tests 

administered 


Place of 
testing 


Date of 
testing 


Sample 

size 


Educational 

characteristics 


Males 


Females 


Pretest 

One 


TI, Til, 
Till, RI 


Madison, 
Wise . 


July, 1966 


38 


Students attend- 
ing the Debate 
Program, Wis- 
consin High 
School Speech 
Institute; 

Grades 9 , 10 , 
and 11 


23 


15 


Pretest 

Two 


TI, Til, 
Till, RI, 
RII, RIII 
RIV 


Monona , 
Wise. 


Sept. , 1966 


58 


Tenth, eleventh, 
and twelfth 
graders in an 
elective speech 
course at Monona 
Grove High School 


27 


31 


Pretest 

Three 


TI, Til, 
Till, RI, 
RII, RIII, 
RIV 


Lodi , 
Wise. 


Dec. , 1966 


187 


Grades 7-12 of 
Lodi Junior and 
Senior High Schools 


10L 


86 


Pretest 

Four 


TI, RI, 
RII, RIII, 
RIV 


Juneau 

and 

Reeseville , 
Wise . 


Nov. , 1967 


258 


Grades 1 0-1 2 of 
the Juneau and 
Reeseville High 
Schools 


123 


135 


Norma- 

tive 

Study 


TI, Til, 
Till, RI, 
RII, RIII, 
RIV 


Clinton, April, 1968 

Cedarburg , 

Reedsburg, 
and Owen- 
Withee, Wise. 


3090 

to 

31 18 a 


Grades 7-12 of 
all four schools 


1507 

to 

1 5 1 5 a 


1583 

to 

1 603 a 



3 

Variation due to student absenteeism during the testing period. 



other ratings for the testimony tests were mod- 
erately readable, moderately easy, and mod- 
erately interesting. It should be remembered, 
however, that these students had received con- 
siderable instruction in argumentation and 
should have found the tests less challenging 
than students without special training in the 
area. 

Item interest was also discussed with a 
panel of five high school sophomores who had 
no previous speech or argumentation course 
work. The panel was selected on the basis of 
Henman-Nelson IQ and SCAT Reading Scores 
to represent a range of abilities at that grade 
level. These are given in Table 2. Comments 
by the student consultants were considered 
during subsequent revisions of the test items. 



Table 2 

Panel of Student Consultants 



Student 


Henman-Nelson 

(IQ) 


SCAT 

(Percentile) 


A 


121 


99 


B 


121 


94-99 


C 


143 


99 


D 


107 


70-83 


E 


112 


85-96 



SUBJECT MATTER OF ITEMS 

Items were constructed using commonplace 
information from the subject-matter areas of 



government, entertainment, and education. 
Fictional names of persons, places, and events 
were used where possible. Each test presents 
an approximate balance of items representing 
the three subject-matter areas. The data from 
Pretest One were examined for confounding due 
to subject area variables and revealed no sig - 
nificant differences in scores among the three 
areas. However, since items in these three 
areas deal with common topics of discussion, 
approximately equal representation of these 
subjects was retained. 



CONTENT VALIDITY 

As illustrated in Figure 1 , WISTTRA was con- 
structed to measure cognitive skills related to 
certain fundamental concepts of verbal argument. 
The three tests of testimony were designed to 
measure the student's ability to detect instances 
which violate common internal and external tests 
of testimony. The reasoning tests were designed 
to measure the student's ability to recognize the 
essential components of an argument, to ask 



relevant questions about arguments , and to 
draw correct conclusions from arguments . 

Based upon pilot study information, subtests 
for Testimony I and Testimony III were retained 
as illustrated in Figure 1 . The pilot study re- 
sults indicated that subtests need not be re- 
tained for Testimony II and the four reasoning 
tests . Further study of the dimensionality of 
all the tests is in progress using factor analytic 
procedures . 

At two points in the development of the tests 
— before Pretest One and prior to the Normative 
Study — the battery was submitted to panels of 
experts in the field of argumentation trained in 
the conceptual basis of the instrument. On 
both occasions three-judge panels were used. 
Following a Q-sort technique the judges were 
asked to place items in relevant categories or 
in a 'cannot tell' category. Criteria for cate- 
gorizing items included (where relevant) argu- 
ment type, type of rule violated, statement 
type, and completeness of argument. Judge 
agreement ranged from 94.9 to 98.9% for the 
tests coded in the initial stages of develop- 
ment and from 85.4 to 98.4% for the tests used 
in the normative study. The decline in coder 
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Each of seven warrant types — sign, class, 
causal, alternative, parallel case, com- 
parative, and warrant-supportive — is rep- 
resented by four items in each reasoning 
test . 



Figure 1 . Relationship of WISTTRA to Concepts Identified 
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agreement is attributable to the fact that only 
items which achieved high coder agreement 
were used in drawing up the first edition of 
the tests while the pool of items coded on the 
second occasion consisted of all items com- 
prising the normative edition of the tests. 



INTERNAL CONSISTENCY RELIABILITY 

Hoyt analysis of variance reliability esti- 
mates were obtained for all of the tests. This 
is an internal consistency measure of reliability 
and as such estimates consistency of perform- 
ance on a relatively homogeneous power test. 

Rigid standards for the interpretation of 
reliability estimates are not overly meaningful. 
As a rule of thumb, reliabilities of at least *80 
are recommended for evaluating level of group 
accomplishment and .90 for evaluating level 
of individual accomplishment. In practice how- 
ever, reliability estimates of .50 to .80 are 
often treated as indications of a relatively pre- 
cise enough instrument for group differentiation. 
Thorndike and Hagen (1961) discuss this prob- 
lem in terms of the percent of times the direc- 
tion of difference will be reversed in subsequent 
testing for scores falling at the 75 th and 50th 
percentiles for various values of reliability 
estimates. The security of a conclusion based 
upon a particular test increases much more 
rapidly for groups than it does for individuals 
as the reliability of the test increases . For 
example, the probability of reversal is one in 
three for scores of single individuals when the 
reliability is .50; the probability of reversal 
is 1 in 20 for means of groups of 25 when the 
reliability is .50. 

The standard error of measurement is a sec- 
ond index of test consistency. This is a meas- 
ure of the variability of the scores a subject 
would obtain on repeated measurements using 
the same test. The standard error of measure- 
ment indicates how much his obtained score 
for a single administration is likely to vary 
with repeated testing, i.e. , how nearly " cor- 
rect this obtained score is. For a student's 
hypothetical distribution of repeated scores on 
a test, his obtained score would fall within 
one standard error* value of his actual obtained 
score about two-thirds of the time. 

Another way to look at the interpretation of 
a reliability estimate is in terms of the size of 
the standard error of measurement relative to 
the standard deviation of test scores . This is 
discussed by Thorndike (1951) . If the relia- 
bility is zero the standard error of measurement 
would equal the standard deviation of the test. 
For reliability estimates of .80 and .90 the 
standard error of measurement is reduced to 



only 45% and 32% of its value for zero relia- 
bility. 

Maximum reliabilities were sought for all 
tests but the researchers' expectations were 
conditioned by two considerations: ( 1 ) some 

of the tests are composed of subtests suffi- 
ciently divergent in character to reduce the 
overall homogeneity of the total tests (TI and 
Till) and (2) some subtests are composed of 
so few items that high reliabilities are not 
likely (subtests of RI, RII, RIII, and RIV) . For 
these reasoning tests the total test reliability 
should not have been affected by ( 2 ) except 
that item data for the subtests were used in 
selecting items for inclusion in the total test. 
The purpose of this action was to enable the 
researchers to obtain reliabilities for the four 
item subtests of a sufficient magnitude to en- 
able further, study of the dimensionality of the 
tests . 

ITEM CHARACTERISTICS 

During the development of the tests, items 
were continually revised to improve the instru- 
ment on the basis of item characteristic data 
obtained from the GITAP item analysis pro- 
gram (Baker 1966, 1968), This program pro- 
vides difficulty level, biserial correlation, 
Xgg, and 3 statistics for each choice of each 
item. In addition it gives descriptive statis- 
tics, the standard err r of measurement, and 
the Hoyt reliability estimate for the total test. 
Certain item characteristic criteria were used 
in selecting and refining items on the basis of 
the GITAP results. Items to be retained in a 
revised edition of the test had to meet the 
minimum requirement as given for each of the 
following criteria for the correct choice: 

1 . Preferably fall within a middle difficulty 
range as defined by Ebel (1965). See 
Table 3. 

2. Have a biserial correlation > .30. 

3. Have anXgg between +2.00 and -2.00. 

4 . Have a 3 > . 30 . 

In addition each incorrect choice had to meet 
the following minimum requirements; 

1. Have a reasonable minimum proportion 
of subjects respond to it. 

2. Have a biserial correlation < - . 25 and 
preferably < - . 30 . 

3. Have an Xgg lower than the X 50 for the 
correct choice. 

4 . Have a g < - . 25 and preferably < - . 30 . 

These criteria were established in consultation 
with staff of the R & D Center and on the basis 
of reasonably standard rules of thumb for item 
evaluation.. 
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In a few cases where one or more choices 
of an item were slightly deficient in meeting 
one or more of the standards but it was felt 
that the item was still basically good, slight 
revisions were made in the item. In so far as 
possible it was desired that all items meet 
these criteria on the basis of each of two analy- 
ses — one with total test score as the criterion 
ability and another with appropriate subtest 
score as the criterion ability. 

The difficulty of an item is indexed by the 
proportion of subjects who responded correctly 
to that item. Thus, the greater the value of 
the difficulty index the easier the item. An 
item of middle difficulty is defined by Ebel 
(19 65) as one for which the proportion of cor- 
rect responses is halfway between the expected 
chance proportion and 100 %, and he further 
states that items in a midrange of difficulty 
— 30% to 70% of the nonchance range — are 
almost as effective discriminators as are items 
of middle difficulty. This middle difficulty 
range was taken into consideration in defining 
desirable levels of difficulty for the items of 
WISTTRA. These levels are specified in Table 
3. In general, in assembling the total test the 
items were ordered by increasing level of dif- 
ficulty. 

For the biserial correlation, Xgg, and 3 
item statistics, both the results for total test 
and appropriate subtest analyses were used. 

An attempt was made to use only items that met 
the standards on the basis of both analyses. 

In a few cases where this was impossible, 
the subtest analysis was the prime considera- 
tion . 

The biserial correlation coefficient is an 
index of the discriminating ability of the item 
choice. For this analysis the criterion ability 
used was total test score. As with any cor- 
relation there is no rigid standard for interpret- 
ing a biserial correlation . Maximum correla- 
tions were desired for WISTTRA and .30 was 



ra 



set as a minimum for the correct choice. A 
low biserial correlation means that the item 
is not discriminating across the criterion 
ability range — a student who had a poor cri- 
terion score would be almost as likely to get 
the item correct as one who had a good criterion 
score. The negative biserial correlation for the 
incorrect choices indicates a descending slope 
of the regression line from left to right. Thus , 
poor students would be more likely to respond 
to those choices than would good students. 

The greater the absolute value of the correla- 
tion the greater the discriminating power of 
the item . 

X 50 is the point on the criterion scale, 
given in standard deviation units, correspond- 
ing to the median of the item characteristic 
curve and is the point at which the item choice 
has maximum discrimination. Figure 2 illus- 
trates a typical item characteristic curve. 
Subjects with a criterion score equal, to Xgg 
have a 50-50 chance of choosing that response. 
Thus, +2.00 and -2.00 were used as desirable 
limits as this range would include approximately 
99% of the cases. It was essential, for an item 
to be retained in the test, that the Xgg value 
for all the incorrect choices be less than that 
for the correct choice with the exception of all 
two choice items. For two choice items the 
X 5 g value is the same for both choices. 

3 can be thought of as the slope of the item 
characteristic curve at the X^g point and is an 
index of the discrimination power of the item. 
The higher the 3 value the greater the slope 
of the curve and the more clearly the item is 
discriminating. The maximum positive 3s were 
desired for the correct choice and negative ones 
required for all incorrect choices . 

SUMMARY 

In developing WISTTRA the researchers at" 
tempted to structure an instrument capable of 




Table 3 

Preferable Difficulty Levels for WISTTRA 



Test 


Number of 
Choices 


Chance 

Probability 


Middle Difficulty 
Point 


Middle Difficulty 
Range 


TI, Til 
Till, RII 


2 


.500 


.750 


. 65 0 to .850 


RIV 


3 


.333 


.667 


. 534 to .800 


RIII 


4 


.250 


.625 


.475 to .775 


RI 


5 


.200 


. 600 


. 440 to .760 
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Figure 2. A Typical Item Characteristic Curve 
Reproduced from Baker, 1965. 



stable, precise discriminations across the 
broad ability ranges of high school students. 
At all stages standard procedures and criteria 
were employed to insure as much as possible 
that the final test battery would perform 
according to these expectations. Ideals of 
this sort must, however, be tempered by the 
demands of interestxngness , practicability, 



and content validity, as well as by the fact 
that cognitive skills tend to elude precise 
measurement. This section has attempted to 
convey the criteria and considerations which 
the investigators employed to balance the 
often conflicting demands such a testing in- 
strument must satisfy. 
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IV 

DISCUSSION OF SPECIFIC TESTS 



This section of the paper presents a discus- 
sion of WISTTRA on a test by test basis pro- 
viding a statement of each test's objectives 
and a brief description of each test as it exists 
in its latest edition. 



TESTIMONY I: Appraising Testimony in 

Terms of Internal Criteria 

Objective 

Testimony I is designed to measure a stu- 
dent's ability to use the internal tests of testi- 
mony to discriminate between reliable and un- 
reliable instances of testimony. 

Structure of the Test 

Testimony I consists of 60 two-choice items. 
Each item presents the name of a source (per- 
son, office, publication, etc.) and a statement 
(generalization, quality judgment, statistic, 
etc.) made by that source on some particular 
topic. Consider the following two examples: 

1 . Senate Reporter: Sixty-five percent of the 

Republican Senators voted 
for the Smith-Doe Bill. 

2. High School Student: Sun Village is an ex- 

cellent example of 
modern literature . 

Students are asked to indicate whether they 
would accept or reject that statement on the 
grounds that it was made by that source. Four 
criteria for acceptance are represented in the 
items — (l) Is the source in a position to ob- 
serve? (2) Is the source competent to observe? 
(3) Is the source unbiased? and (4) Is the source 
qualified to Judge? Positive (accept) items, 
such as Example 1 above, are those which 
meet all four criteria; negative (reject) items 
meet three of the four, but do not fulfill the 
fourth criterion (e.g., in position, competent, 



unbiased, but not qualified to judge) , as in 
Example 2 above. A student responding cor- 
rectly on all items would accept 20 instances 
and reject 40. Violations of internal criteria 
are distributed across reject items in a bal- 
anced fashion so that each criterion is repre- 
sented by 10 reject items. 

There are indications that the five subtests 
of Testimony I should be kept as individual 
tests and not grouped into one composite test. 
Further study on the dimensionality of the tests 
is in progress and will be reported in A Factor 
Analytic Study of the Wisconsin Tests of Testi - 
mony and Reasoning Assessment (Harris , 1969) . 

TESTIMONY II: Appraising Testimony in 

Terms of External Criteria: 

Consistency vith Other 
Testimony 

Objective 

Testimony II is designed to measure a stu- 
dent's ability to recognize inconsistency be- 
tween two instances of testimony. 

Structure of the Test 

Testimony II consists of 20 two-choice 
items. Each item presents two similar state- 
ments attributed to the same source or to dif- 
ferent sources. Two examples are: 

1A. Sam, Pro Golf 

Official: The greens are in good 
condition for today's 
match. 

B . Sam , Pro Golf 

Official: Even the best of the 

golfers are complaining 
about the rough spots on 
the greens in today' s 
match. 
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2A. American League 

President: Public interest in base- 
ball has declined in 
the last year. 

B. National League 

President: All major league clubs 

have had sizable in- 
creases in attendance 
over the past year. 

Students are asked to compare the two instances 
of testimony and determine whether accepting 
the second instance would make them more or 
less likely to accept the first. In the two ex- 
amples above, the second instance clearly 
makes the first instance less likely. These 
items represent the single criterion of consis- 
tency between instances of testimony using 
statements by the same source or statements 
by different sources. Testimony II presents 
10 consistent and 10 inconsistent pairs of 
testimony. Both consistent and inconsistent 
items are balanced so that single-source pairs 
and two-source pairs appear with the same 
frequency. 



TESTIMONY III: Appraising Testimony in 

Terms of External Criteria: 

Recency and Proximity 

Objective 

Testimony III is designed to measure a stu- 
dent's ability to use the external tests of prox- 
imity and recency to discriminate between re- 
liable and unreliable instances of testimony. 

Structure of the Test 

Testimony III consists of 40 two-choice 
items. Each item presents a pair of similar 
statements attributed to different sources. 
Proximity differences (primary versus secondary 
information) are contrasted within 20 of the 40 
items by structuring one member of the pair as 
a reference to some other source and the sec- 
ond member of the pair as a source of the sort 
referred to by the first. For example: 

1A. Health Inspector: The pool's chlorine 

content was too high 
today when I tested 
it. 

B. Pool Director: I was told that today's 

tests indicated the con- 
centration of chlorine in 
the water to be above 
the suggested limit. 

In this case, the health inspector's statement 
is to be preferred since he is a primary source 



reporting a direct observation whereas the pool 
director is a secondary source who is merely 
reporting what someone else told him. Recency 
differences (recent vs . out-of-date information) 
are contrasted in the remaining 20 items by in- 
cluding time references which indicate that 
one statement of the pair is more recent than 
the other. For example: 

2A. Teacher A: As we begin the 1966—67 
school year we can expect 
more than 75 books to dis- 
appear from the library. 

B» Teacher B: Less than fifty books were 
missing from the library 
when the 1966—67 school 
year ended. 

In this case, when the subject in question is 
the number of books disappearing during a 
school year, the testimony of Teacher B is to 
be preferred since it is the more recent. In 
order to measure a student's ability to handle 
these criteria in realistic settings, each of the 
subtests presents 10 pairs of consistent in- 
stances of testimony and 10 pairs of incon- 
sistent instances. The student is asked to 
select the best instance of testimony independ- 
ent of any inclination to agree with either of 
the sources . 

Previous preliminary study indicates that 
Testimony III consists of two somewhat inde- 
pendent subtests . Reliability and item char- 
acteristic information will be given for Testi- 
mony III as a composite and as two separate 
subtests. Further study on the dimensionality 
of the tests is in progress. 

REASONING I: Recognizing and Selecting 

Warrants in Arguments 

Objective 

Reasoning I is designed to measure a stu- 
dent's ability to recognize the absence of war- 
rants and to select appropriate warrants when 
needed. 

Structure of the Test 

Reasoning I consists of 28 five-option mul- 
tiple choice items. Each item presents an 
argument based on reasoning which is either 
complete, (data, warrant, and claim) or incom- 
plete (data and claim, but irrelevant informa- 
tion instead of warrant) . Four possible war- 
rants and a none-needed option are given for 
each item. For example: 

Mr. Hens war, who moved into our neighbor- 
hood last week, is a judge. He is one of 



11 



mm 



9EEB9S9B9K’ 



the new judges I have never seen. It can 
be concluded that Mr. Henswar is a digni- 
fied man. 

A. Mr, Henswar is a typical newly ap- 
pointed judge. 

B. Repeated presence in court is a sign 
of dignity in a man. 

C. Judges are dignified men. 

D. Mr. Henswar is more dignified than 
our pre ious judge. 

E. None needed. 



Students are asked to mark option E if the 
argument is complete as it stands in the initial 
paragraph. If not, they are to select the ap- 
propriate warrant from options A— D. For in- 
stance, since th warrant is absent in the 
above example, e student should select re- 
sponse C which ovides an appropriate infer- 
ence license for a class argument. A student 
responding correctly to all questions will select 
E for 10 of the 28 items. Each item represents 
one of seven argument types — sign, cause, 
class, comparative, parallel case, alternative, 
and warrant- supportive . Each of the seven 
argument types is represented by four items, 
two or three of which require completion by 
appropriate warrant selection. In addition, 
an effort was made to distribute warrant types 
among the distractor options in a balanced 
fashion. Thus, in 4 he above example, the 
incorrect responses A, B, and D are warrants 
for warrant- supportive , sign, and comparative 
arguments respectively. 



REASONING II: Recognizing Statements 

Which Answer Reservations 
in Arguments 

Obj ective 

Reasoning II is designed to measure the 
student's recognition of statements in argu- 
ments which anticipate and answer reservations. 

Structure of the Test 

Reasoning II consists of 28 two-choice 
items. Each item presents a pair of complete 
arguments containing data, warrant, claim, 
and some additional information. The student 
is instructed to indicate which of the two argu- 
ments is better. For example: 

A. George is probably a good young farmer 
because he is a member of our school's 
Future Farmers of America Club. Last 
year he won four blue ribbons at the 
county fair with his dairy cattle. Our 



county agricultural agent says that mem- 
bership in the FFA is a pretty good sign 
that a boy is a good young farmer. 

B. George is probably a good young farmer 
because he is a member of our school's 
Future Farmers of America Club. The 
Club has twenty-two members and meets 
every Saturday morning in the Agricul- 
tural Lab. Our county agricultural agent 
says that membership in the FFA is a 
pretty good sign that a boy is a good 
young farmer. 

The paired arguments are alike except that in 
one of them the additional information is ir- 
relevant to the argument while in the other the 
information removes or answers a possible 
refutation (reservation) of the argument. Thus, 
in the above example, the first argument is to 
be preferred to the second because the first 
contains information which answers the "lack 
of concurrent sign" reservation. All seven 
argument types are represented equally in the 
items. Reasoning II contains four items for 
each of the seven argument types: sign, 
cause, class, comparative, parallel case, 
alternative, and warrant-supportive. 



REASONING III: Selecting Reservations 

in Arguments 



Objective 

Reasoning III is designed to measure the 
student's ability to discriminate between rele- 
vant and irrelevant reservations. 

Structure of the Test 

Reasoning III consists of 28 four-option 
multiple choice items. Each item presents a 
complete argument (data, warrant, and claim) 
and four statements to be considered as pos- 
sible reservations to the argument. For ex- 
ample: 

Enrollment in our high school has been 
steadily increasing. Since increases in 
enrollment force school boards to build 
new high schools, our school board will 
probably build another high school soon. 

A. Unless there is still plenty of room in 
the old high school. 

B. Unless all school boards face increas- 
ing enrollments . 

C„ Unless our school system has fewer 

students than many other systems which 
built new high schools. 




D. Unless the relationship between enroll- 
ments expressed by the word "increased" 
does not imply future changes in enroll- 
ment. 

The student is asked to select from the four the 
reservation which best qualifies or refutes the 
argument. In each case only one of the four 
choices is a relevaht reservation to the argument 
type represented. The other three responses , 
although appearing as reservations in terms of 
phrasing, do not lessen the confidence which 
may be placed in the claim advanced. Thus, 
answer A in the example above represents the 
partial cause reservation to a cause-effect argu- 
ment and is therefore the appropriate response 
while the other three responses simply provide 
seemingly relevant information in a reserva- 
tion form inappropriate to a causal argument. 

Reasoning III presents four items for each of 
the seven argument types. This format enabled 
the investigators to use each of the various res- 
ervation types at least once in connection with 
an appropriate argument type and to roughly 
balance the distribution of reservation types 
across distractor choices. 

REASONING IV: Selecting Claims in Arguments 

Objective 

Reasoning IV is designed to measure a stu- 
dent's ability to select the claim appropriate 
to a given argument. 

Structure of the Test 

Reasoning IV consists of 28 three-option 
multiple choice items. Each item presents the 



data, warrant, and some additional information 
for an argument. The items are systematically 
varied such that the additional information is 
irrelevant in 9 items, provides an answer to a 
reservation in 10 other items, and raises a 
reservation (with no answer given) in the re- 
maining 9 items . The student is instructed to 
select the proper claim to the argument (of 
two choices given) or to indicate that it is not 
possible to make a proper claim given the in- 
formation presented in the argument. For ex- 
ample: 

Johnny always turns his work in on time. 
Turning work in on time plays an important 
part in passing college courses. Sometimes 
Johnny spends little time on his work and 
does not care much about studying. There- 
fore: 

A. Johnny probably will pass his college 
courses . 

B. You really can't tell whether Johnny 
will pass his college courses. 

C. Johnny probably does careful work in 
his courses . 

The three possible answers are constructed 
such that one states a topically related idea 
which does not follow the structure of the 
data and warrant, one denies that a particular 
conclusion is possible, and one presents a 
straightforward claim. A student responding 
correctly would select the "cannot tell" op- 
tion for items with unanswered reservations 
(as in the example above) and the straight- 
forward claim in all other cases. Again each 
of the argument types is represented by four 
items . 
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V 

RELIABILITY ESTIMATES AMD ITEM STATISTICS 



The reliability estimates and item statistics 
reported in this section were obtained in the 
normative study. In all cases, they are given 
separately for each sex group for each grade 
(seven through twelve) . 



SAMPLE 

The total number of subjects tested ranged 
from 3090 to 3118 for any one test. The total 
number of subjects within a single age and 
sex group ranged from 190 to 311 for any one 
test. These subjects were obtained by ran- 
domly sampling schools from a single stratifi- 
cation of the population of Wisconsin school 
districts. This was accomplished by using 
the results of a study by Miller et al. (19 67) 
which describes Wisconsin school districts 
on the basis of factor scores for a number of 
factors. The following five factors were used 
in identifying a homogeneous stratified popu- 
lation for the study: (1) numerical size, 

(2) organizational complexity, (3) teacher 
experience, (4) economic power, and (5) size 
of school unit. For further details on the popu- 
lation and sampling procedures used refer to 
A Study of Student Abilities in the Evaluation 
of Verbal Argument (Rott, Feezel, & Allen, in 
press) . 



RELIABILITY ESTIMATES 

Reliability estimates were computed using 
the Hoyt analysis of variance procedures and 
were obtained as part of the results of the Gen- 
eralized Item and Test Analysis Program (Baker, 
1968) used to analyze the tests. These esti- 
mates are presented in Tables 4 throuc,:,- 17 for 
each of the seven tests aa a total test and for 
subtests of Testimony I and Testimony III. Also 
included in these tables, for each grade and sex 
group, is the sample size, mean, standard devi- 
ation, range, and standard error of measurement. 



ITEM STATISTICS 

A summary of the item statistics (difficulty, 
biserial correlation, X 5 O/ and 3) for the correct 
choices for each of the seven tests as a total 
test and for subtests of Testimony I and Testi- 
mony III are given in Tables 18 through 31. 
The investigators realize there are problems 
with using the mean as a measure of central 
tendency for the biserial correlation and 8 
since they are not linear, but it was felt the 
mean would give the reader some indication of 
central tendency and at least show the general 
increase in the value of these statistics from 
grade seven through grade twelve. 
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Table 4 

Testimony I: Reliability Estimates for the Total Test 



Grade/Sex 


N 


Mean Score 
(60 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


37.89 


7.44 


18-58 


3.48 


.78 


7F 


251 


38.96 


6.62 


12-56 


3.42 


.73 


8M 


228 


39.78 


7.85 


24-54 


3.38 


.81 


8F 


224 


40.87 


7.06 


25-55 


3. 29 


«, 78 


9M 


304 


41.00 


8.04 


13-56 


3. 29 


.83 


9F 


302 


42.22 


7.23 


25-57 


3.19 


.80 


10M 


287 


43.69 


8.04 


6-58 


3.08 


.85 


1 OF 


30 2 


44.53 


6.70 


23-57 


2.97 


.80 


11M 


253 


45.45 


6.84 


23-57 


2.92 


.81 


1 IF 


265 


44.38 


6.45 


16-56 


2.96 


.79 


12M 


190 


44.98 


8.09 


22-59 


2.97 


.86 


12F 


253 


45.30 


6.73 


12-5 7 


2.88 


.81 



Table 5 

Testimony I: Reliability Estimates for the Accept Subtest 


Grade/Sex 


N 


Mean Score 
(20 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error or 
Measurement 


Hoyt 

Reliability 


7M 


246 


14.22 


3.90 


0-20 


1.80 


.78 


7F 


251 


15.00 


3.23 


5-20 


1.75 


.69 


CO 

t-O 


228 


15.00 


3.50 


7-20 


1.74 


.74 


8F 


224 


15.65 


3.16 


7-20 


1.64 


.72 


9M 


304 


15.42 


3.57 


6-20 


1.66 


.77 


9F 


302 


16.19 


3.05 


5-20 


1.54 


.73 


10M 


287 


16.43 


3.46 


7-20 


1.47 


.80 


1 OF 


302 


1 7 . 1 4 


2.48 


8-20 


1.37 


.68 


11M 


253 


17.50 


2.61 


7-20 


1.31 


.74 


1 IF 


265 


17.38 


2.53 


6-20 


1.31 


.72 


12M 


190 


17.14 


3.21 


7-20 


1.37 


.81 


12F 


25 3 


17.68 


2.41 


4-20 


1.23 


.72 
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Table 6 

Testimony I: Reliability Estimates for the Bias Subtest 



Standard 



Grade/Sex 


N 


Mean Score 
(10 max.) 


Standard 

Deviation 


Range 
of Scores 


Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


5.16 


2.17 


0-9 


1.32 


.59 


7F 


251 


5.19 


2.14 


1-10 


1 . 29 


. 60 


8M 


228 


5.48 


2.05 


0-10 


1.34 


.53 


8F 


224 


5.38 


2. 28 


0-10 


1 . 26 


.66 


9M 


304 


5.53 


2.15 


0-10 


1.33 


.57 


9F 


302 


5.49 


2.12 


0-10 


1 . 27 


. 60 


10M 


287 


5.86 


2.13 


0-10 


1.27 


.61 


1 OF 


302 


5.57 


2. 20 


1-10 


1. 21 


.66 


11M 


253 


6.00 


2. 21 


1-10 


1 . 21 


. 67 


1 IF 


265 


5.62 


2.19 


0-10 


1.20 


.67 


12M 


190 


5.91 


2.15 


0-10 


1 . 26 


.61 


12F 


253 


5.74 


2.31 


0-10 


1.19 


.70 



Table 7 

Testimony I: Reliability Estimates for the Position Subtest 












Standard 








Mean Score 


Standard 


Range 


Error of 


Hoyt 


Grade/Sex 


N 


(10 max.) 


Deviation 


of Scores 


Measurement 


Reliability 


7M 


246 


5.65 


1.91 


0-10 


1.41 


.39 


7F 


251 


5.59 


2.00 


1-10 


1.40 


.46 


8M 


228 


5.91 


1.92 


0-10 


1.40 


.41 


8F 


224 


5.91 


1.90 


1-10 


1.37 


.43 


9M 


304 


6.19 


1.88 


2-10 


1.36 


.42 


9F 


302 


6.36 


2.09 


0-10 


1.32 


.56 


10M 


287 


6.50 


1.96 


1-10 


1.30 


.51 


1 OF 


302 


6.66 


1.90 


1-10 


1 . 27 


.50 


11M 


253 


6.47 


1.92 


1-10 


1. 29 


.50 


1 IF 


265 


6.42 


1.93 


2-10 


1 . 28 


.51 


12M 


190 


6.54 


2.08 


1-10 


1 . 26 


.59 


12F 


25 3 


6.67 


2.03 


1-10 


1 . 22 


. 60 
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Table 8 

Testimony I: Reliability Estimates for the Competence Subtest 



Grade/Sex 


N 


Mean Score 
(10 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


6.55 


2.03 


1-10 


1.34 


.52 


7F 


251 


6.34 


2.11 


1-10 


1.35 


. 55 


8M 


228 


6.75 


2.01 


2-10 


1.32 


.52 


8F 


224 


6.88 


2.07 


1-10 


1. 29 


.57 


9M 


304 


6.94 


2.04 


2-10 


1.28 


.57 


9F 


302 


6.88 


2.18 


1-10 


1.26 


.63 


10M 


287 


7.45 


2.01 


2-10 


1 . 20 


.60 


1 OF 


302 


7.30 


2.13 


2-10 


1.19 


. 65 


11M 


253 


7.63 


1.89 


2-10 


1.16 


.58 


1 IF 


265 


7.09 


2.10 


2-10 


1.22 


.62 


12M 


190 


7.55 


1.96 


1-10 


1.18 


.60 


12F 


253 


7.32 


1.97 


2-10 


1 . 20 


.58 








Table 9 










Testimony I: Reliability Estimates 


for the Qualification Subtest 














Standard 








Mean Score 


Standard 


Range 


Error of 


Hoyt 


Grade/Sex 


N 


(10 max.) 


Deviation 


of Scores 


Measurement 


Reliability 


7M 


246 


6.31 


2.01 


0-10 


1.37 


.48 


7F 


251 


6.84 


1.75 


2-10 


1.33 


.36 


8M 


228 


6 . 65 


2.02 


2-10 


1.32 


.52 


8F 


224 


7.05 


1.97 


3-10 


1 . 28 


.53 


9M 


304 


• 6.91 


1.98 


2-10 


1.28 


.53 ' 


9F 


302 


7.30 


1.90 


2-10 


1.24 


.53 


10M 


287 


7.45 


1.92 


2-10 


1.21 


.56 


1 OF 


302 


7.88 


1.73 


3-10 


1.12 


.53 


11M 


253 


7.86 


1 . 67 


3-10 


1.14 


.48 


1 IF 


265 


7.89 


1.72 


3-10 


1.13 


.52 


12M 


190 


7.85 


1.91 


3-10 


1.12 


. 61 


12F 


253 


7.88 


1.72 


2-10 


1.12 


.53 
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Table 10 



Testimony II: Reliability Estimates 



Grade/Sex 


N 


Mean Score 
(20 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7 M 


246 


13.14 


3.83 


0-20 


1.93 


.73 


71 ' 


251 


12.77 


3.86 


4-20 


1.95 


.73 


8M 


228 


12.67 


4.32 


4-20 


1.92 


.79 


8F 


224 


13.16 


4.26 


6-20 


1.88 


.79 


9M 


304 


12.75 


4.36 


4-20 


1.91 


.80 


9F 


302 


13.26 


4.51 


4-20 


1.84 


.82 


10M 


287 


13.33 


4.89 


2-20 


1.80 


.86 


10F 


302 


15.14 


4.32 


5-20 


1.64 


.85 


11M 


253 


14.60 


4.58 


0-20 


1.69 


.86 


11F 


265 


14.83 


4.79 


3-20 


1.63 


.88 


12M 


190 


14.52 


4.76 


6-20 


1.68 


.87 


12F 


25 3 


14.98 


4.69 


0-20 


1.61 


.88 







Testimony III: 


Table 11 

Reliability Estimates for the Total Test 




Grade /Sex 


N 


Mean Score 
(40 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


23.59 


5.58 


12-37 


2.93 


.72 


7F 


251 


23.55 


5.33 


14-37 


2.90 


.70 


8M 


227 


24.04 


6.14 


5-40 


2.89 


.77 


8F 


223 


25. 25 


5. 29 


14-38 


2.80 


.71 


9M 


303 


24.40 


5.99 


11-38 


2.86 


.77 


9F 


305 


25.68 


5.72 


8-38 


2.77 


.76 


10M 


228. 


26.28 


6.54 


3-40 


2.73 


.82 


1 OF 


311 


27.05 


5.83 


16-39 


2.65 


.79 


11M 


25 6 


27.33 


6.66 


0-40 


2.64 


.84 


1 IF 


262 


28.36 


6.07 


10-39 


2.55 


.82 


12M 


195 


26.82 


6.42 


13-39 


2.70 


.82 


12F 


251 


27.69 


6.33 


4-40 


2.56 


.83 
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Table 12 

Testimony III: Reliability Estimates for the Recency Subtest 






Grade/Sex 


N 


Mean Score 
(20 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7 M 


246 


12.57 


3.32 


4-20 


2.00 


.62 


7F 


251 


12.88 


3.42 


7-20 


1.97 


.65 


8M 


227 


12.73 


3.61 


6-20 


1.97 


.69 


8F 


223 


13.73 


3.44 


7-20 


1.86 


.69 


9M 


303 


12.91 


3.64 


5-20 


1.95 


.70 


9F 


305 


13.90 


3.36 


7-20 


1.85 


.68 


10M 


288 


14.15 


3.64 


7-20 


1.83 


.73 


1 OF 


311 


14.69 


3.21 


6-20 


1.75 


.71 


11M 


25 6 


14.54 


3.70 


0-20 


1.76 


.76 


1 IF 


262 


15.40 


3.28 


7-20 


1.66 


.73 


12M 


195 


14.30 


3.64 


6-20 


1.79 


.75 


12F 


251 


14.97 


3.25 


7-20 


1.69 


.72 



Table 13 

Testimony III: Reliability Estimates for the Proximity Subtest 


Grade/Sex 


N 


Mean Score 
(20 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


11.02 


3.22 


3-20 


2.06 


.57 


7F 


251 


10.67 


3.07 


4-19 


2.05 


.53 


8M 


227 


11.31 


3.53 


5-20 


2.03 


.65 


8F 


223 


11.52 


3.09 


4-19 


2.01 


.56 


9M 


303 


11.49 


3.40 


4-20 


2.02 


.63 


9F 


305 


11.78 


3.42 


2-19 


1.98 


.65 


10M 


288 


12.13 


3.87 


1-20 


1.94 


.73 


1 OF 


311 


12.63 


3.71 


2-20 


1.90 


.72 


11M 


25 6 


12.79 


3.86 


0-20 


1.89 


.74 


1 IF 


262 


12.96 


3.81 


4—20 


1.84 


.75 


12M 


195 


12.52 


3.64 


4-20 


1.94 


.70 


12F 


251 


12.72 


3.99 


4-20 


1.83 


.78 
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Table 14 

Reasoning I: Reliability Estimates 



Grade/Sex 


N 


Mean Score 
(28 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


245 


8.24 


5.00 


0-24 


2.18 


.80 


7F 


248 


8.97 


4.87 


0-25 


2. 24 


.78 


8M 


230 


9.41 


5.58 


1-28 


2.23 


.83 


8F 


218 


10.79 


5.99 


3-27 


2. 24 


.85 


9M 


302 


9.69 


5*77 


1-25 


2. 23 


.84 


9F 


301 


11.61 


6.57 


1-27 


2. 24 


.88 


10M 


277 


12.90 


6.89 


1-28 


2.25 


.89 


1 OF 


294 


14.48 


7.48 


2-28 


2.16 


.91 


11M 


262 


14.08 


7.58 


1-28 


2.18 


.91 


1 IF 


270 


15. 21 


7.48 


1-28 


2.16 


.91 


12M 


191 


13.86 


7.84 


1-28 


2.16 


.92 


1 2F 


25 2 


15.70 


7.16 


3-28 


2.18 


.90 



Table 15 

Reasoning II: Reliability Estimates 


Grade/Sex 


N 


Mean Score 
(28 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


246 


15.94 


3.91 


7-26 


2.42 


.60 


7F 


251 


17.24 


4.16 


9-27 


2.37 


. 66 


8M 


227 


17.76 


4.40 


8-28 


2.38 


.70 


8F 


223 


18. 25 


4.43 


9-27 


2.30 


.72 


9M 


303 


18.00 


4.87 


8-28 


2.32 


.76 


9F 


305 


19.55 


4.62 


7-28 


2.17 


.77 


10M 


288 


18.88 


4.95 


6-27 


2.14 


.81 


1 OF 


311 


21.00 


4.85 


9-28 


2.00 


.82 


1 1 M 


256 


20.81 


5.19 


0-28 


2.05 


.84 


1 1 F 


262 


21 .50 


4.53 


9-28 


1.94 


.81 


12M 


195 


21 .02 


5.15 


9-28 


2.03 


.84 


12F 


251 


21.90 


4.67 


5-28 


1.86 


.84 
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Table 16 

Reasoning III: Reliability Estimates 



Grade/Sex 


N 


Mean Score 
(28 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


245 


12.44 


5.67 


3-27 


2.34 


.82 


7F 


248 


13.77 


5.24 


3-26 


2.35 


.80 


8M 


230 


13.89 


6.13 


3-28 


2.31 


.85 


8F 


218 


15.72 


5.89 


4-26 


2.27 


.85 


9M 


302 


14.66 


6.36 


3-26 


2. 27 


.87 


9F 


301 


16.57 


5.92 


0-27 


2.24 


.85 


10M 


277 


17.55 


6. 22 


3-28 


2.18 


.87 


1 OF 


294 


19.14 


5.42 


5-28 


2.10 


.84 


11M 


262 


18.68 


6. 25 


0-28 


2.11 


.88 


1 IF 


270 


19.53 


5.63 


2-28 


2.06 


.86 


12M 


191 


18.03 


6.71 


4-28 


2.12 


.90 


12F 


25 2 


20.59 


4.94 


5-28 


2.00 


.83 



Table 17 

Reasoning IV: Reliability Estimates 


Grade/Sex 


N 


Mean Score 
(28 max.) 


Standard 

Deviation 


Range 
of Scores 


Standard 
Error of 
Measurement 


Hoyt 

Reliability 


7M 


245 


14.39 


4.97 


4-27 


2.40 


.76 


7F 


248 


15.41 


4.73 


5-27 


2.39 


.74 


8M 


230 


16.06 


4.81 


5-25 


2.37 


.75 


8F 


218 


16.94 


4.86 


5-27 


2.33 


.76 


9M 


302 


16.60 


5. 25 


4-27 


2.30 


.80 


9F 


301 


17.82 


5.09 


4-28 


2. 25 


.80 


10M 


277 


18.72 


5.12 


5-28 


2. 21 


.81 


1 OF 


294 


19.73 


4.51 


6-28 


2. ,15 


.77 


11M 


262 


19.23 


4.96 


3-28 


2.17 


.80 


11F 


270 


20.17 


4.35 


6-28 


2.10 


.76 


12M 


191 


19.04 


5.63 


5-28 


2.14 


.85 


12F 


25 2 


21 .16 


3.88 


10-28 


2.03 


.72 
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Testimony I: Item Statistics for the Accept Subtest 

(20 Items) 
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The X 50 could not be computed for the item for which the biserial correlation exceeded 1.00. 
The highest 8 could not be computed since the highest biserial correlation exceeded 1.00. 
Based on 19 items. 



Testimony III: Item Statistics for the Total Test 
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VI 

CONCLUSIONS 



The reliability estimates obtained for all of 
the tests for each age and sex group are suf- 
ficient for research purposes and to evaluate 
gro '’0 differences. In addition, for some of the 
to:.-. ■ particularly for Grades 10—12, the relia- 
b lies are of a sufficient magnitude 
tc- evaluation of differences among 

ind / ujb- . If the further study of the dimen- 
sional/ of the tests indicates that the sub- 
tests of Testimony I and Testimony III should 
be considered as independent tests they should 
be lengthened to be more reliable. 

The items, in general, exhibit the charac- 
teristics sought by the investigators. Many of 
the items fall within the middle difficulty range. 
Most items discriminate rather sharply, as in- 
dexed by high biserial correlations and 3s. 

Most of the items which have low biserial cor- 
relations and 3s are found in one of two tests, 
Testimony I or Testimony III, when total test 
score is the criterion measure. These low cor- 
relations may be indications that at least some 



items are measuring different abilities and that 
subtests should perhaps be retained. Most of 
these same items have correlations and Bs above 
.30 for the appropriate subtest when it is the 
criterion measure. As evidenced by the X^q item 
statistics, many more items are maximally dis- 
criminating among students of low and middle 
abilities than among students of high ability. 
Thus, these items are discriminating more clearly 
among less able students than they are among 
more able students. In general, the item sta- 
tistics tend to increase in value from Grade 7 
to Grade 12. 

Although the final edition of the tests was de- 
signed primarily for Grades 10—12, there are in- 
dications that the tests might also yield useful 
information for Grades 7—9. A more exact inter- 
pretation of the adequacy of the reliability and 
item statistics of the tests is left to the reader 
and potential user who should judge the value 
of the tests for his particular purpose. 
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